Enormous volumes of short reads data from next-generation sequencing (NGS)technologies have posed new challenges to the area of genomic sequencecomparison. The multiple sequence alignment approach is hardly applicable to NGS data dueto the challenging problem of short read assembly. Thus alignment-free methods need to be developed for the comparison of NGSsamples of short reads. Recently, new $k$-mer based distance measures such as {\it CVTree},$d_{2}^{S}$, {\it co-phylog} have been proposed to address this problem. However, those distances depend considerably on the parameter $k$, and how tochoose the optimal $k$ is not trivial since it may depend on different aspectsof the sequence data. Hence, in this paper we consider an alternative parameter-free approach:compression-based distance measures. These measures have shown impressive performance on long genome sequences inprevious studies, but they have not been tested on NGS short reads. In this study we perform extensive validation and show that thecompression-based distances are highly consistent with those distances obtainedfrom the $k$-mer based methods, from the alignment-based approach, and fromexisting benchmarks in the literature. Moreover, as these measures are parameter-free, no optimization is requiredand they still perform consistently well on multiple types of sequence data,for different kinds of species and taxonomy levels. The compression-based distance measures are assembly-free, alignment-free,parameter-free, and thus represent useful tools for the comparison of longgenome sequences and NGS samples of short reads.
展开▼
机译:来自下一代测序(NGS)技术的大量短读数据对基因组序列比较领域提出了新的挑战。由于短读组装的挑战性问题,多序列比对方法几乎不适用于NGS数据。因此,需要开发无比对方法来比较短读的NGS样本。最近,已经提出了新的基于$ k $ mer的距离度量,例如{\ it CVTree},$ d_ {2} ^ {S} $,{\ it co-phylog},以解决此问题。但是,这些距离在很大程度上取决于参数$ k $,如何选择最佳$ k $并非易事,因为它可能取决于序列数据的不同方面。因此,在本文中,我们考虑了另一种无参数的方法:基于压缩的距离度量。这些措施在以前的研究中已显示出在长基因组序列上令人印象深刻的性能,但尚未在NGS短读数上进行测试。在这项研究中,我们进行了广泛的验证,并表明基于压缩的距离与从基于$ k $ mer的方法,基于比对的方法以及现有文献中获得的距离高度一致。此外,由于这些措施是无参数的,因此无需进行优化,并且对于不同种类的物种和分类标准,它们在多种类型的序列数据上仍然表现良好。基于压缩的距离度量是无程序集,无对齐,无参数的,因此是用于比较长基因组序列和NGS短读样本的有用工具。
展开▼